Annotating the biomedical literature for the human variome

نویسندگان

  • Karin M. Verspoor
  • Antonio Jimeno-Yepes
  • Lawrence Cavedon
  • Tara McIntosh
  • Asha Herten-Crabb
  • Zoë Thomas
  • John-Paul Plazzer
چکیده

This article introduces the Variome Annotation Schema, a schema that aims to capture the core concepts and relations relevant to cataloguing and interpreting human genetic variation and its relationship to disease, as described in the published literature. The schema was inspired by the needs of the database curators of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is intended to have application to genetic variation information in a range of diseases. The schema has been applied to a small corpus of full text journal publications on the subject of inherited colorectal cancer. We show that the inter-annotator agreement on annotation of this corpus ranges from 0.78 to 0.95 F-score across different entity types when exact matching is measured, and improves to a minimum F-score of 0.87 when boundary matching is relaxed. Relations show more variability in agreement, but several are reliable, with the highest, cohort-has-size, reaching 0.90 F-score. We also explore the relevance of the schema to the InSiGHT database curation process. The schema and the corpus represent an important new resource for the development of text mining solutions that address relationships among patient cohorts, disease and genetic variation, and therefore, we also discuss the role text mining might play in the curation of information related to the human variome. The corpus is available at http://opennicta.com/home/health/variome.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

BACKGROUND The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction syste...

متن کامل

Metab2MeSH: annotating compounds with medical subject headings

SUMMARY Progress in high-throughput genomic technologies has led to the development of a variety of resources that link genes to functional information contained in the biomedical literature. However, tools attempting to link small molecules to normal and diseased physiology and published data relevant to biologists and clinical investigators, are still lacking. With metabolomics rapidly emergi...

متن کامل

Impact of Corpus Diversity and Complexity on NER Performance

We describe a cross-corpora evaluation of disease mention recognition for two annotated biomedical corpora: the Human Variome Project Corpus and the Arizona Disease Corpus. Our analysis of the performance of a state-of-the-art NER tool in terms of the characteristics and annotation schema of these corpora shows that these factors significantly affect performance.

متن کامل

Design of a Standoff Object-Oriented Markup Language (sooml) for Annotating Biomedical Literature

With the rapid growth of electronically available scientific literature, text mining is attracting increasing attention. While numerous algorithms, tools, and systems have been developed for extracting information from text, little effort has been focused on how to mark up the information. We present the design of a standoff, object-oriented markup language (called SOOML), which is simple, expr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2013  شماره 

صفحات  -

تاریخ انتشار 2013